Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql import functions as F
from datetime import datetime
from decimal import Decimal
Template
spark = (
SparkSession.builder
.master("local")
.appName("Section 2.1 - Looking at Your Data")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
sc = spark.sparkContext
import os
data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
| id | breed_id | nickname | birthday | age | color | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown |
| 1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None |
| 2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None |
Looking at Your Data
Spark is lazily evaluated. To look at your data you must perform a take operation to trigger your transformations to be evaluated. There are a couple of ways to perform a take operation that we'll go through here, and their performance.
For example, the toPandas() is a take operation which you've already seen in many places.
Option 1 - collect()
pets.collect()
[Row(id=u'1', breed_id=u'1', nickname=u'King', birthday=u'2014-11-22 12:30:31', age=u'5', color=u'brown'),
Row(id=u'2', breed_id=u'3', nickname=u'Argus', birthday=u'2016-11-22 10:05:10', age=u'10', color=None),
Row(id=u'3', breed_id=u'1', nickname=u'Chewie', birthday=u'2016-11-22 10:05:10', age=u'15', color=None)]
What Happened?
When you call collect on a dataframe, it will trigger a take operation, bring all the data to the driver node and then return all rows as a lists of Row objects.
Note
This should not be advised unless you have to look at all the rows of your dataset, you should usually sample a subset of the data. This call will execution all of the transformations that you have specified on all the data.
Option 2 - head()/take()/first()
pets.head(n=1)
[Row(id=u'1', breed_id=u'1', nickname=u'King', birthday=u'2014-11-22 12:30:31', age=u'5', color=u'brown')]
What Happened?
When you call head(n) on a dataframe, it will trigger a take operation and return the first n rows of the result dataset. The different operations will return different number of rows.
Note
- If the data is unsorted, spark will perform the all the transformations on a selected amount of partitions until the number of rows are satified. This is much optimal based on how much and large your dataset is.
- If the data is sorted, spark will perform the same as a
collectand performallof thetransformationsonallof the data.
By sorted we mean, if any sort of "sorting of the data" is done during the transformations, such as sort(), orderBy(), etc.
Option 3 - toPandas()
pets.toPandas()
| id | breed_id | nickname | birthday | age | color | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown |
| 1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None |
| 2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None |
What Happened?
When you call a toPandas() on a dataframe, it will trigger a take operation and return all of the rows.
This is as performant as the collect() function, but the most readible in my opinion.
Option 4 - show()
pets.show()
+---+--------+--------+-------------------+---+-----+
| id|breed_id|nickname| birthday|age|color|
+---+--------+--------+-------------------+---+-----+
| 1| 1| King|2014-11-22 12:30:31| 5|brown|
| 2| 3| Argus|2016-11-22 10:05:10| 10| null|
| 3| 1| Chewie|2016-11-22 10:05:10| 15| null|
+---+--------+--------+-------------------+---+-----+
What Happened?
When you call a show() on a dataframe, it will trigger a take operation return up to 20 rows.
This is as performant as the head() function and more readible. (I still perfer toPandas() 😀).
Summary
- We learnt about various functions that allow you to look at your data.
- Some functions are less performant than others based on if the resultant data is sorted or not.
- Try to refrain from looking at all the data, unless you are required to.